It is more useful to solve problems with large state spaces
Backgammon: 10^20 states
Chess: 10^40 states
Heliocopter: continuous state space
By now we had our value function as a table, but it is not possible to store all the states in a table. We need to approximate the value function.
Every state s has an entry in V(s) (prediction)
Every state-action pair s has an entry in Q(s,a) (control)
For large MDPs, we can't store all the states and/or actions in the memory. We need to approximate the value function.
It is also slow to learn the value of each state individually. We need to generalize the value function.
Solution for Large MDPs
Estimate the value function with a parameterized function V(s,w)≈Vπ​(s) Q(s,a,w)≈Qπ​(s,a)
Generalize from the training states to the unseen states
Update the weights w to minimize the error using monte-carlo (MC) or temporal difference (TD) learning
Function Approximation
There are many ways to approximate the value function. Some of them are:
Linear function approximation
Neural networks
Decision trees
Nearest neighbors
Fourier basis functions
Tile coding
we consider differentiable function approximators, because we can use gradient-based optimization methods to update the weights. Those are suitable for non-stationary, non-i.i.d. data.
Since we are updating the policy, the data is non-stationary. The data is also non-i.i.d. because the states are correlated.
Gradient Descent
Let J(w) be the objective function that we want to minimize. We can update the weights w using the gradient of the objective function.
We can't compute the expectation exactly, so we use a sample to estimate the expectation:
Δw=α(Vπ​(s)−V(s,w))∇V(s,w)
Feature Vector
We need to represent the state s as a feature vector x(s). The feature vector is a vector of real numbers that represent the state s. The feature vector is a function of the state s.
For example:
Distance to the goal,
Distance to the obstacles,
Trends in the stock market
Queen and king positions in chess
Linear Function Approximation
The value function is a linear combination of the feature vector:
V(s,w)=xT(s)w=j=1∑n​xj​(s)wj​
where w is the weight vector, and x(s) is the feature vector.
J(w)=Eπ​[(Vπ​(s)−xT(s)w)2]
now the objective function is quadratic in w, so we can find the minimum of the objective function using the gradient:
∇V(s,w)=x(s)Δw=α(Vπ​(s)−xT(s)w)x(s)
since the feature vector is incorporated in the update rule, the update rule favors the features that are more important for the value function.
Incremental Prediction Algorithms
In practice we substitute the true value of the state with the return from the sample: